The real estate domain encompasses buying, selling and renting of property and houses. Whenever we buy or sell a house, we often need a real estate agent to quote a price for the same since they know which factors are important and how much they affect the house’s price, in other words domain knowledge. This project aims to build a model that can help the estate agent explain and the consumer understand the factors or features that affect house prices the most, and also the price itself. The paper Modeling Home Prices Using Realtor Data is an example of predicting house prices using 12 features extracted from realtor data. In this project we try to solve a similar ongoing kaggle competition, but with a much larger feature set. The motivation of this project is feature engineering.
# Import libraries necessary for this project
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=1.5)
from IPython.display import display # Allows the use of display() for DataFrames
from time import time
from scipy.stats import skew
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor
import lightgbm as lgb
# Pretty display for notebooks
%matplotlib inline
All the import statements go into the above cell.
In this section we will have preliminary insights about the dataset.
# Read the training and testing set to pandas dataframes.
try:
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")
print("Training set has {} samples with {} features each.".format(*train_data.shape))
print("Testing set has {} samples with {} features each.".format(*test_data.shape))
except Exception as e:
print("Dataset could not be loaded. Error: {}".format(e))
print("Please make sure the training and the testing set is present")
Let's have a glance at our training and testing data.
display(train_data.head())
display(test_data.head())
As expected the target variable 'SalePrice' is not present in the testing set. The dataset has missing values for 'LotArea', 'Alley', 'PoolQC' etc. Also since the dataset has an explicit 'Id' feature let's set that as the index for both the dataframes.
train_data.set_index('Id', inplace=True)
test_data.set_index('Id', inplace=True)
display(train_data.head())
display(test_data.tail())
Let's see how many missing values are present in our training and testing data.
# Add the total number of missing values for each feature in the training and testing set.
missing_train = train_data.isnull().sum().sum()
missing_test = test_data.isnull().sum().sum()
print("There are a total of {} missing values out of {} in the training set.".format(missing_train, train_data.shape[0] * train_data.shape[1]))
print("There are a total of {} missing values out of {} in the testing set.".format(missing_test, test_data.shape[0] * test_data.shape[1]))
In the coming sections we will handle the above missing data.
Numerical vs. Categorical
numerical_features = train_data.select_dtypes(include="number").columns
categorical_features = train_data.select_dtypes(include="object").columns
print("There are a total of {} numerical features in the training set.".format(numerical_features.size))
print("There are a total of {} categorical features in the training set.".format(categorical_features.size))
Contrary to the above, there are a few categorical features that has some order/ranking in them, example 'Utilities'. These will addressed later in the Preprocessing section. Also in the coming section namely Feature Reduction, we will perform Exploratory Data Analysis (EDA) in detail.
Let's have some insights about the target variable 'SalePrice'.
train_data.SalePrice.describe()
The above suggests that the target variable 'SalePrice' is skewed. Let's see a distribution of the same.
sns.distplot(train_data.SalePrice)
print("Feature: SalePrice, Skewness: {:.4f}".format(train_data.SalePrice.skew()))
print("Feature: SalePrice, Kurtosis: {:.4f}".format(train_data.SalePrice.kurt()))
As depicted from the above, the target variable 'SalePrice' is skewed consisting almost all houses in the range of [ 34900, 400000 ]. A higher kurtosis also suggest the presence of outliers.
In this section we will handle missing data in the dataset. Values can be filled based on the mean/median/mode of the feature or the most plausible value based on domain knowledge.
Note: Care should be taken when filling values, that is, we should only use the value based on the training set. For example if mode is chosen to fill missing values in a given feature
'f', the mode of the training set should be used instead of the combination of both training and testing set.
Let's combine the training and testing set and handle missing data simultaneously. Since we need to drop 'SalePrice' from the training set for successful concatenation, let's see if there are any missing values in the same.
# Add the total number of missing values for SalePrice feature.
print("There are {} missing values for SalePrice feature.".format(train_data.SalePrice.isnull().sum()))
# Concatenate training and testing set.
combined_data = pd.concat([train_data.drop('SalePrice', axis=1), test_data])
print("The combined training and testing set has {} samples with {} features each.".format(*combined_data.shape))
display(combined_data.head())
Let's see missing values information for each feature.
missing_data_count = combined_data.isnull().sum().sort_values(ascending=False)
missing_data_percent = (combined_data.isnull().sum()*100.0 / combined_data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([missing_data_count, missing_data_percent], keys=['Count', 'Percent'], axis=1)
missing_data = missing_data[missing_data.Count > 0]
print("There are a total of {} features with missing data as shown below.".format(*missing_data.shape))
display(missing_data)
Let's see a plot of the same.
missing_data.Percent.plot(kind='bar', title="Percentage of missing values grouped by features", fontsize=15, figsize=(20, 8))
Features with more than 95 percent missing values, namely 'PoolQC' and 'MiscFeature' can be considered a candidate for feature reduction and will be covered in the same section. However features with 1 or 2 missing values can be handled by removing the data points itself, if and only if those data points lie in the training set and number of those data points is less than or equal to 3. Let's check the latter first.
# Find features with 2 or less missing values.
low_missing_data_features = missing_data[missing_data.Count <= 2 ].index
print("Features with 2 or less missing values: {}".format(low_missing_data_features.tolist()))
print("-" * 100)
indexes = combined_data.loc[combined_data[low_missing_data_features].isnull().any(axis=1)].index
print("There are {} data points which contain missing values for the above features: {}".format(indexes.size, indexes.tolist()))
Since there are more than 3 data points which contain missing values for the features with 2 or less missing values, we cannot delete those points. Additionally out of the above data points there is only one data point in the training set. Now that we have identified features with missing values, we should start filling those. Let's have a sample of 10 data points to determine the methodology for filling missing values for each feature.
with pd.option_context('display.max_rows', None, 'display.max_columns', 99):
display(combined_data[missing_data.index].sample(10))
With all the above features with 10 sample values for each, let's start filling missing values for the same.
# Fill missing values for Functional. A NaN value signifies Typical by default.
combined_data.Functional = combined_data.Functional.fillna("Typ")
# Fill missing values for PoolQC, MiscFeature, Alley, Fence, FireplaceQu, MasVnrType, Exterior2nd.
# A NaN value signifies None for the respective feature.
for feature in [ 'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu', 'MasVnrType', 'Exterior2nd' ]:
combined_data[feature] = combined_data[feature].fillna("None")
# Fill missing values for GarageCond, GarageQual, GarageYrBlt, GarageFinish, GarageType.
# A NaN value signifies no Garage.
for feature in [ 'GarageCond', 'GarageQual', 'GarageYrBlt', 'GarageFinish', 'GarageType' ]:
combined_data[feature] = combined_data[feature].fillna("None")
# Fill missing values for BsmtCond, BsmtExposure, BsmtQual, BsmtFinType2, BsmtFinType1.
# A NaN value signifies no Basement.
for feature in [ 'BsmtCond', 'BsmtExposure', 'BsmtQual', 'BsmtFinType2', 'BsmtFinType1' ]:
combined_data[feature] = combined_data[feature].fillna("None")
# Fill missing values for BsmtHalfBath, BsmtFullBath, BsmtFinSF2, BsmtFinSF1, BsmtUnfSF, TotalBsmtSF.
# A NaN value signifies no Basement. Hence 0 for the respective feature.
for feature in [ 'BsmtHalfBath', 'BsmtFullBath', 'BsmtFinSF2', 'BsmtFinSF1', 'BsmtUnfSF', 'TotalBsmtSF' ]:
combined_data[feature] = combined_data[feature].fillna(0)
# Fill missing values for MasVnrArea, GarageArea, GarageCars. A NaN value signifies 0 for the respective feature.
for feature in [ 'MasVnrArea', 'GarageArea', 'GarageCars' ]:
combined_data[feature] = combined_data[feature].fillna(0)
# Fill missing values for LotFrontage. Should be filled with mean.
combined_data.LotFrontage = combined_data.LotFrontage.fillna(train_data.LotFrontage.mean())
# Fill missing values for MSZoning, Utilities, Exterior1st, SaleType, Electrical, KitchenQual. Should be filled with mode.
for feature in [ 'MSZoning', 'Utilities', 'Exterior1st', 'SaleType', 'Electrical', 'KitchenQual' ]:
combined_data[feature] = combined_data[feature].fillna(train_data[feature].mode()[0])
# Verify that there are no missing values.
missing = combined_data.isnull().sum().sum()
print("There are {} missing values remaining in the combined dataset.".format(missing))
Let's reconstruct our training and testing data. Remember that our combined dataset does not contain 'SalePrice' feature and therefore it needs to be added for reconstructed training set.
# Separate training and testing data based on the number of samples in the respective set.
train_data_sans_missing = pd.concat([combined_data[:train_data.shape[0]], pd.DataFrame(train_data.SalePrice)], axis=1)
test_data_sans_missing = combined_data[train_data.shape[0]:]
display(train_data_sans_missing.head())
display(test_data_sans_missing.head())
In this section we will analyze outliers and remove them if necessary. We will perform Interquartile Range (IQR) to detect outliers. Additionally the Author of the dataset advises to remove any house with 'GrLivArea' greater than 4000 square feet to eliminate all potential outliers.
Note: Since IQR works with continuous numerical features, we should be careful when selecting the same.
Let's perform Interquartile Range (IQR) on the training set for all the numerical features which are continuous.
continuous_numerical_features = [ 'LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
'1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'SalePrice' ]
# Append the outliers of each feature in a dataframe for further analysis.
columns = ["feature", "count", "indices"]
outliers_dataframe = pd.DataFrame(columns=columns)
for feature in continuous_numerical_features:
Q1 = np.percentile(train_data_sans_missing[feature], 25)
Q3 = np.percentile(train_data_sans_missing[feature], 75)
step = (Q3-Q1)*1.5
lower_bound = Q1 - step
upper_bound = Q3 + step
feature_outliers = train_data_sans_missing[~((train_data_sans_missing[feature] >= lower_bound) & (train_data_sans_missing[feature] <= upper_bound))]
outliers_dataframe = outliers_dataframe.append({ "feature": feature, "count": feature_outliers.shape[0], "indices": feature_outliers.index.tolist() }, ignore_index=True)
Let's plot the Number of Outliers per feature using the above process.
outliers_dataframe.sort_values(by="count", ascending=False).plot(x="feature", y="count", kind="bar", title="Number of Outliers per feature",
fontsize=15, figsize=(20, 8), legend=False)
Since there are large number of outliers present for each feature, we need to select ones which affects most features. Let's see 15 outliers and their respective feature count.
# Find outliers which affect most features.
display(pd.DataFrame(outliers_dataframe.indices.sum(), columns=["feature_count"]).feature_count.value_counts().nlargest(15).to_frame())
As depicted from the above if we choose to remove outliers that affects at least 6 features we would be removing 11 data points which seems a lot considering we originally performed IQR on 20 features. However, there are only 5 data points when we consider outlier that affects at least 7 features, which become our final outliers.
outliers_indices = [1299, 524, 1183, 692, 198]
Let's see how our results compare to the author's suggestion of removing data points with 'GrLivArea' greater than 4000.
# Find houses with GrLivArea greater than 4000.
display(train_data_sans_missing.loc[(train_data_sans_missing.GrLivArea > 4000), ["GrLivArea", "SalePrice"]])
IQR did a good job of finding the outliers since it contains all the data points that has GrLivArea greater than 4000 which is inline with the author's suggestion. Let's remove the outliers.
print("Pre Outlier removal: There are {} samples in the training set.".format(*train_data_sans_missing.shape))
train_data_sans_missing.drop(train_data_sans_missing.loc[outliers_indices].index, inplace=True)
print("Post Outlier removal: There are {} samples in the training set.".format(*train_data_sans_missing.shape))
Save the updated training and testing set to disk.
# Save the updated training set to file
filename = "chkp01_train.csv"
is_present = glob.glob(filename)
if not is_present:
train_data_sans_missing.to_csv(filename, index=True)
print("Updated training set written to file: {}".format(filename))
else:
print("File: {} already present. Make sure to delete the same when running the whole notebook.".format(filename))
# Save the updated testing set to file
filename = "chkp01_test.csv"
is_present = glob.glob(filename)
if not is_present:
test_data_sans_missing.to_csv(filename, index=True)
print("Updated testing set written to file: {}".format(filename))
else:
print("File: {} already present. Make sure to delete the same when running the whole notebook.".format(filename))
This checkpoint serves the updated training and testing set with no missing values and outliers removed.
Load the updated dataset.
# Read the training and testing set to pandas dataframes.
try:
train_data = pd.read_csv("chkp01_train.csv")
test_data = pd.read_csv("chkp01_test.csv")
train_data.set_index('Id', inplace=True)
test_data.set_index('Id', inplace=True)
print("Training set has {} samples with {} features each.".format(*train_data.shape))
print("Testing set has {} samples with {} features each.".format(*test_data.shape))
except Exception as e:
print("Dataset could not be loaded. Error: {}".format(e))
print("Please make sure the training and the testing set is present")
In this section we will convert feature values to their correct data types: Categorical (Ordinal and Nominal) and Numerical (Discrete and Continuous).
Let's combine the training and testing dataset as before to handle it simultaneously. Since 'SalePrice' will be dropped from the training set for successful concatenation let's see if it's data type is correct.
train_data.SalePrice.dtype
Should be float64 but that can be done while reading in the data next time.
# Concatenate training and testing set.
combined_data = pd.concat([train_data.drop('SalePrice', axis=1), test_data])
print("The combined training and testing set has {} samples with {} features each.".format(*combined_data.shape))
Now that we have our combined dataset, let's map feature values.
mapped = {
"Street": { "Grvl" : 1, "Pave" : 2 },
"Alley" : { "None" : 0, "Grvl" : 1, "Pave" : 2 },
"LotShape" : { "IR3" : 1, "IR2" : 2, "IR1" : 3, "Reg" : 4 },
"Utilities": { "ELO" : 1, "NoSeWa" : 2, "NoSewr" : 3, "AllPub" : 4 },
"LandSlope": { "Sev" : 1, "Mod" : 2, "Gtl" : 3 },
"ExterQual" : { "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5 },
"ExterCond" : { "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5 },
"BsmtQual": { "None" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5 },
"BsmtCond" : { "None" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5 },
"BsmtExposure" : { "None" : 0, "No" : 1, "Mn" : 2, "Av" : 3, "Gd" : 4 },
"BsmtFinType1" : { "None" : 0, "Unf" : 1, "LwQ" : 2, "Rec" : 3, "BLQ" : 4, "ALQ" : 5, "GLQ" : 6 },
"BsmtFinType2" : { "None" : 0, "Unf" : 1, "LwQ" : 2, "Rec" : 3, "BLQ" : 4, "ALQ" : 5, "GLQ" : 6 },
"HeatingQC" : { "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5 },
"CentralAir" : { "N" : 0, "Y" : 1 },
"KitchenQual" : { "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5 },
"Functional": { "Sal" : 1, "Sev" : 2, "Maj2" : 3, "Maj1" : 4, "Mod" : 5, "Min2" : 6, "Min1" : 7, "Typ" : 8 },
"FireplaceQu" : { "None" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5 },
"GarageFinish" : { "None" : 0, "Unf" : 1, "RFn" : 2, "Fin" : 3 },
"GarageQual" : { "None" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5 },
"GarageCond" : { "None" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5 },
"PavedDrive": { "N" : 0, "P" : 1, "Y" : 2 },
"PoolQC" : { "None" : 0, "Po" : 1, "Fa" : 2, "TA" : 3, "Gd" : 4, "Ex" : 5 },
"Fence" : { "None" : 0, "MnWw" : 1, "GdWo" : 2, "MnPrv" : 3, "GdPrv" : 4 },
}
combined_data = combined_data.replace(mapped)
# For some reason GarageYrBlt is read with a decimal point.
# Since it's a string type object because of some "None" values, we need to apply a string method to remove the trailing ".0"
combined_data.GarageYrBlt = combined_data.GarageYrBlt.str.replace("\.0", "")
Let's reconstruct our training and testing data. Remember that our combined dataset does not contain 'SalePrice' feature and therefore it needs to be added for reconstructed training set.
# Separate training and testing data based on the number of samples in the respective set.
train_data_correct_dtypes = pd.concat([combined_data[:train_data.shape[0]], pd.DataFrame(train_data.SalePrice)], axis=1)
test_data_correct_dtypes = combined_data[train_data.shape[0]:]
Save the updated training and testing set to disk.
# Save the updated training set to file
filename = "chkp02_train.csv"
is_present = glob.glob(filename)
if not is_present:
train_data_correct_dtypes.to_csv(filename, index=True)
print("Updated training set written to file: {}".format(filename))
else:
print("File: {} already present. Make sure to delete the same when running the whole notebook.".format(filename))
# Save the updated testing set to file
filename = "chkp02_test.csv"
is_present = glob.glob(filename)
if not is_present:
test_data_correct_dtypes.to_csv(filename, index=True)
print("Updated testing set written to file: {}".format(filename))
else:
print("File: {} already present. Make sure to delete the same when running the whole notebook.".format(filename))
This checkpoint serves the updated training and testing set with feature values mapped to their correct data types.
Load the updated dataset. Additionally this time we will explicitly mention the dtypes of the column while reading the dataset.
column_types = {
'MSSubClass' : np.int64, 'MSZoning' : np.object, 'LotFrontage' : np.float64, 'LotArea' : np.float64, 'Street' : np.int64,
'Alley' : np.int64, 'LotShape' : np.int64, 'LandContour' : np.object, 'Utilities' : np.int64, 'LotConfig' : np.object,
'LandSlope' : np.int64, 'Neighborhood' : np.object, 'Condition1' : np.object, 'Condition2' : np.object, 'BldgType' : np.object,
'HouseStyle' : np.object, 'OverallQual' : np.int64, 'OverallCond' : np.int64, 'YearBuilt' : np.int64, 'YearRemodAdd' : np.int64,
'RoofStyle' : np.object, 'RoofMatl' : np.object, 'Exterior1st' : np.object, 'Exterior2nd' : np.object, 'MasVnrType' : np.object,
'MasVnrArea' : np.float64, 'ExterQual' : np.int64, 'ExterCond' : np.int64, 'Foundation' : np.object, 'BsmtQual' : np.int64,
'BsmtCond' : np.int64, 'BsmtExposure' : np.int64, 'BsmtFinType1' : np.int64, 'BsmtFinSF1' : np.float64, 'BsmtFinType2' : np.int64,
'BsmtFinSF2' : np.float64, 'BsmtUnfSF' : np.float64, 'TotalBsmtSF' : np.float64, 'Heating' : np.object, 'HeatingQC' : np.int64,
'CentralAir' : np.int64, 'Electrical' : np.object, '1stFlrSF' : np.float64, '2ndFlrSF' : np.float64, 'LowQualFinSF' : np.float64,
'GrLivArea' : np.float64, 'BsmtFullBath' : np.int64, 'BsmtHalfBath' : np.int64, 'FullBath' : np.int64, 'HalfBath' : np.int64,
'BedroomAbvGr' : np.int64, 'KitchenAbvGr' : np.int64, 'KitchenQual' : np.int64, 'TotRmsAbvGrd' : np.int64, 'Functional' : np.int64,
'Fireplaces' : np.int64, 'FireplaceQu' : np.int64, 'GarageType' : np.object, 'GarageYrBlt' : np.object, 'GarageFinish' : np.int64,
'GarageCars' : np.int64, 'GarageArea' : np.float64, 'GarageQual' : np.int64, 'GarageCond' : np.int64, 'PavedDrive' : np.int64,
'WoodDeckSF' : np.float64, 'OpenPorchSF' : np.float64, 'EnclosedPorch' : np.float64, '3SsnPorch' : np.float64, 'ScreenPorch' : np.float64,
'PoolArea' : np.float64, 'PoolQC' : np.int64, 'Fence' : np.int64, 'MiscFeature' : np.object, 'MiscVal' : np.float64,
'MoSold' : np.int64, 'YrSold' : np.int64, 'SaleType' : np.object, 'SaleCondition' : np.object, 'SalePrice' : np.float64
}
# Read the training and testing set to pandas dataframes.
try:
train_data = pd.read_csv("chkp02_train.csv", dtype=column_types)
test_data = pd.read_csv("chkp02_test.csv", dtype=column_types)
train_data.set_index('Id', inplace=True)
test_data.set_index('Id', inplace=True)
print("Training set has {} samples with {} features each.".format(*train_data.shape))
print("Testing set has {} samples with {} features each.".format(*test_data.shape))
except Exception as e:
print("Dataset could not be loaded. Error: {}".format(e))
print("Please make sure the training and the testing set is present")
Let's have a glance at our training and testing data.
display(train_data.head())
display(test_data.head())
In this section we will perform feature reduction to identify most relevant features for model building. Since we have large number of features, subset of feature set is created that are closely related to each other subject to domain knowledge. We can choose to remove unimportant features from each subset and still account for aggregation of the same. Below are subset of the features set.
feature_set_sale = [ 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition' ]
feature_set_position = [ 'Neighborhood', 'Condition1', 'Condition2' ]
feature_set_dwelling = [ 'MSSubClass', 'MSZoning', 'HouseStyle', 'BldgType' ]
feature_set_living_area = [ '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea' ]
feature_set_basement = [ 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF' ]
feature_set_rooms_kitchen_bathrooms = [ 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd' ]
feature_set_amenities = [ 'Utilities', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'Fireplaces', 'FireplaceQu', 'PoolArea', 'PoolQC', 'MiscFeature' ]
feature_set_garage = [ 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond' ]
feature_set_porch = [ 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch' ]
feature_set_exterior_finish = [ 'OverallQual', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'ExterQual', 'ExterCond' ]
feature_set_lot = [ 'LotFrontage', 'LotArea', 'LotShape', 'LotConfig' ]
feature_set_miscellaneous = [ 'PavedDrive', 'Street', 'Alley', 'Fence', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'Foundation', 'Functional',
'MasVnrArea', 'MasVnrType', 'LandContour', 'LandSlope' ]
reduced_feature_set = []
candidate_feature_removal = []
Let's have a look at features that are closely related to Sale of the House.
print("feature_set_sale: {}".format(feature_set_sale))
# Append the target variable to get the correlation with the same.
if "SalePrice" not in feature_set_sale:
feature_set_sale.append("SalePrice")
correlation = train_data[feature_set_sale].corr()
plt.subplots(figsize=(20, 15))
sns.heatmap(correlation, annot=True, annot_kws={"size": 15}, cmap="coolwarm", robust=True)
Let's take a look at some plots.
train_data.loc[:, ["MiscVal", "SalePrice"]].plot(x="MiscVal", y="SalePrice", kind="scatter", title="SalePrice vs MiscVal", fontsize=15, figsize=(14, 8))
'MiscValue' seems like a good candidate for feature removal.
candidate_feature_removal.append("MiscValue")
t = train_data.loc[:, ["MoSold", "SalePrice"]].groupby("MoSold").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs MoSold", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs MoSold", fontsize=15, figsize=(14, 8))
'MoSold' is slightly correlated to 'SalePrice'.
t = train_data.loc[:, ["YrSold", "SalePrice"]].groupby("YrSold").agg(["mean", "median", "count"])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice vs YrSold", fontsize=15, figsize=(14, 8))
t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs YrSold", fontsize=15, figsize=(14, 8))
'YrSold' can be removed from the feature set.
candidate_feature_removal.append("YrSold")
t = train_data.loc[:, ["SaleType", "SalePrice"]].groupby("SaleType").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs SaleType", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs SaleType", fontsize=15, figsize=(14, 8))
'SaleType' should not be removed.
t = train_data.loc[:, ["SaleCondition", "SalePrice"]].groupby("SaleCondition").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs SaleCondition", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs SaleCondition", fontsize=15, figsize=(14, 8))
'SaleCondition' should not be removed.
reduced_feature_set.extend([ 'MoSold', 'SaleType', 'SaleCondition' ])
print("Reduced feature set: {}".format(reduced_feature_set))
Let's have a look at features that are closely related to Position of the House.
print("feature_set_position: {}".format(feature_set_position))
t = train_data.loc[:, ["Neighborhood", "SalePrice"]].groupby("Neighborhood").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs Neighborhood", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs Neighborhood", fontsize=15, figsize=(14, 8))
'Neighborhood' should not be removed.
t = train_data.loc[:, ["Condition1", "SalePrice"]].groupby("Condition1").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs Condition1", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs Condition1", fontsize=15, figsize=(14, 8))
'Condition1' should not be removed.
t = train_data.loc[:, ["Condition2", "SalePrice"]].groupby("Condition2").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs Condition2", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs Condition2", fontsize=15, figsize=(14, 8))
'Condition2' should not be removed.
All the features in 'feature_set_position' are kept in the 'reduced_feature_set'.
reduced_feature_set.extend([ 'Neighborhood', 'Condition1', 'Condition2' ])
print("Reduced feature set: {}".format(reduced_feature_set))
Let's have a look at features that are closely related to Dwelling of the House.
print("feature_set_dwelling: {}".format(feature_set_dwelling))
# Since there is only one numerical feature we can directly see the correlation with the target variable.
correlation = train_data["MSSubClass"].corr(train_data["SalePrice"])
print("Correlation between MSSubClass and SalePrice: {:.4f}".format(correlation))
t = train_data.loc[:, ["MSSubClass", "SalePrice"]].groupby("MSSubClass").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs MSSubClass", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs MSSubClass", fontsize=15, figsize=(14, 8))
'MSSubClass' should not be removed.
t = train_data.loc[:, ["MSZoning", "SalePrice"]].groupby("MSZoning").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs MSZoning", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs MSZoning", fontsize=15, figsize=(14, 8))
'MSZoning' should not be removed.
t = train_data.loc[:, ["HouseStyle", "SalePrice"]].groupby("HouseStyle").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs HouseStyle", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs HouseStyle", fontsize=15, figsize=(14, 8))
'HouseStyle' should not be removed.
t = train_data.loc[:, ["BldgType", "SalePrice"]].groupby("BldgType").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs BldgType", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs BldgType", fontsize=15, figsize=(14, 8))
'BldgType' should not be removed.
All the features in 'feature_set_dwelling' are kept in the 'reduced_feature_set'.
reduced_feature_set.extend([ 'MSSubClass', 'MSZoning', 'HouseStyle', 'BldgType' ])
print("Reduced feature set: {}".format(reduced_feature_set))
Let's have a look at features that are closely related to the Living Area of the House.
print("feature_set_living_area: {}".format(feature_set_living_area))
# Append the target variable to get the correlation with the same.
if "SalePrice" not in feature_set_living_area:
feature_set_living_area.append("SalePrice")
correlation = train_data[feature_set_living_area].corr()
plt.subplots(figsize=(20,15))
sns.heatmap(correlation, annot=True, annot_kws={"size": 15}, cmap="coolwarm", robust=True)
sns.pairplot(train_data[feature_set_living_area])
Using the above correlation and scatter mattrix, we can say that 'GrLivArea' is a very important feature when it comes to predicting 'SalePrice'. '1stFlrSF' and '2ndFlrSF' show high correlation with 'GrLivArea' and thus can be removed. However, since '1stFlrSF' itself shows comparable correlation with 'SalePrice' it is advisable to keep it. 'LowQualFinSF' shows very low correlation with 'SalePrice' and thus can be removed. Therefore, considering the above feature, our final candidate for feature removal are '2ndFlrSF' and 'LowQualFinSF'.
candidate_feature_removal.extend([ '2ndFlrSF', 'LowQualFinSF' ])
reduced_feature_set.extend([ '1stFlrSF', 'GrLivArea' ])
print("Reduced feature set: {}".format(reduced_feature_set))
Let's have a look at features that are closely related to the Basement of the House.
print("feature_set_basement: {}".format(feature_set_basement))
# Append the target variable to get the correlation with the same.
if "SalePrice" not in feature_set_basement:
feature_set_basement.append("SalePrice")
correlation = train_data[feature_set_basement].corr()
plt.subplots(figsize=(20,15))
sns.heatmap(correlation, annot=True, annot_kws={"size": 15}, cmap="coolwarm", robust=True)
sns.pairplot(train_data[feature_set_basement])
From the above correlation and scatter matrix, this feature set shows many pairs that are correlated. The pairs ('BsmtCond', 'BsmtQual'), ('TotalBsmtSF', 'BsmtQual'), ('BsmtFinSF1', 'BsmtFinType1') and ('BsmtFinSF2', 'BsmtFinType2') are highly correlated while 'TotalBsmtSF' shows resonable correlation with most of the features. Additionally when we look at the scatter matrix we see that there is a clear boundary with 'TotalBsmtSF' and ('BsmtFinSF1', 'BsmtUnfSF1'). Also the nature of those features suggest that they can be combined. Let's take a look at some samples to see if we can create a new feature out of the current feature set.
display(train_data[feature_set_basement].sample(10))
The table above shows that 'TotalBsmtSF' is indeed split into three features namely 'BsmtFinSF1', 'BsmtFinSF2' and 'BsmtUnfSF' where 'TotalBsmtSF' is the total of the three. We can remove those three features. Additionally 'BsmtCond' can be removed since it's highly correlated with 'BsmtQual'. Although BsmtQual is highly correlated with 'TotalBsmtSF' it should not be removed since it's a measure of the overall Basement and also has high correlation with 'SalePrice' itself. Regarding 'BsmtFinType1', we should keep it in the 'reduced_feature_set' since we have already removed it's correlated paired feature 'BsmtFinSF1'. However we can remove 'BsmtFinType2' because it shows negligible correlation with the target variable 'SalePrice'.
candidate_feature_removal.extend([ 'BsmtCond', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtFinType2', 'BsmtUnfSF' ])
reduced_feature_set.extend([ 'BsmtQual', 'BsmtExposure', 'BsmtFinType1', 'TotalBsmtSF' ])
print("Reduced feature set: {}".format(reduced_feature_set))
Let's have a look at features that are closely related to the Rooms, Kitchens and Bathrooms of the House.
print("feature_set_rooms_kitchen_bathrooms: {}".format(feature_set_rooms_kitchen_bathrooms))
# Append the target variable to get the correlation with the same.
if "SalePrice" not in feature_set_rooms_kitchen_bathrooms:
feature_set_rooms_kitchen_bathrooms.append("SalePrice")
correlation = train_data[feature_set_rooms_kitchen_bathrooms].corr()
plt.subplots(figsize=(20,15))
sns.heatmap(correlation, annot=True, annot_kws={"size": 15}, cmap="coolwarm", robust=True)
sns.pairplot(train_data[feature_set_rooms_kitchen_bathrooms])
'TotRmsAbvGrd' shows high correlation with 'FullBath' and 'BedroomAbvGr'. We can remove 'BedroomAbvGr'. 'FullBath' is kept for now since 'TotRmsAbvGrd' does not include bathrooms according to the data description. 'BsmtHalfBath' shows negligible correlation with the target variable 'SalePrice' and thus can be removed. 'KitchenQual' is an important feature when predicting 'SalePrice'.
candidate_feature_removal.extend([ 'BedroomAbvGr', 'BsmtHalfBath' ])
reduced_feature_set.extend([ 'BsmtFullBath', 'FullBath', 'HalfBath', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd' ])
print("Reduced feature set: {}".format(reduced_feature_set))
Let's have a look at features that are closely related to the Amenities of the House.
print("feature_set_amenities: {}".format(feature_set_amenities))
# Append the target variable to get the correlation with the same.
if "SalePrice" not in feature_set_amenities:
feature_set_amenities.append("SalePrice")
correlation = train_data[feature_set_amenities].corr()
plt.subplots(figsize=(20,15))
sns.heatmap(correlation, annot=True, annot_kws={"size": 15}, cmap="coolwarm", robust=True)
sns.pairplot(train_data[feature_set_amenities])
From the above correlation and scatter matrix, we can see that ('FireplaceQu', 'Fireplaces') and ('PoolQC', 'PoolArea') show very high correlation. We should therefore remove 'Fireplaces' and 'PoolQC'. Additionally we can remove 'Utilities' since it shows very little correlation with 'SalePrice' probably because it is the most skewed variable in the feature set. Let's see how the remaining categorical (nominal) features affect house prices.
t = train_data.loc[:, ["Heating", "SalePrice"]].groupby("Heating").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs Heating", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs Heating", fontsize=15, figsize=(14, 8))
'Heating' should not be removed.
t = train_data.loc[:, ["Electrical", "SalePrice"]].groupby("Electrical").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs Electrical", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs Electrical", fontsize=15, figsize=(14, 8))
'Electrical' should not be removed.
t = train_data.loc[:, ["MiscFeature", "SalePrice"]].groupby("MiscFeature").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs MiscFeature", fontsize=15, figsize=(14, 8))
t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs MiscFeature", fontsize=15, figsize=(14, 8))
Remember that 'MiscFeature' had a lot of missing values and therefore should be removed.
candidate_feature_removal.extend([ 'Utilities', 'Fireplaces', 'PoolQC', 'MiscFeature' ])
reduced_feature_set.extend([ 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'FireplaceQu', 'PoolArea' ])
print("Reduced feature set: {}".format(reduced_feature_set))
Let's have a look at features that are closely related to the Garage of the House.
print("feature_set_garage: {}".format(feature_set_garage))
# Append the target variable to get the correlation with the same.
if "SalePrice" not in feature_set_garage:
feature_set_garage.append("SalePrice")
correlation = train_data[feature_set_garage].corr()
plt.subplots(figsize=(20,15))
sns.heatmap(correlation, annot=True, annot_kws={"size": 15}, cmap="coolwarm", robust=True)
sns.pairplot(train_data[feature_set_garage])
From the above correlation and scatter matrix we see that ('GarageArea', 'GarageCars') and ('GarageCond', 'GarageQual') show high correlation. We should therefore remove 'GarageArea' and 'GarageCond' from the feature set. Let's see how the remaining categorical (nominal) feature affect house prices.
t = train_data.loc[:, ["GarageType", "SalePrice"]].groupby("GarageType").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs GarageType", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs GarageType", fontsize=15, figsize=(14, 8))
'GarageType' should not be removed.
t = train_data.loc[:, ["GarageYrBlt", "SalePrice"]].groupby("GarageYrBlt").agg(["mean", "median", "count"])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice vs GarageYrBlt", fontsize=15, figsize=(20, 8))
t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs GarageYrBlt", fontsize=15, figsize=(20, 8))
Houses with recent Garages built have higher prices. This might also be true since the houses itself are built recently. Since there were few houses with no garages the data type for the same was an object. However if we remove those houses we can find the correlation between 'GarageYrBlt' and 'YearBuilt'. Let's remove the houses with no garages and see the correlation with 'YearBuilt'.
# Remove houses with no Garage.
houses_with_garages = train_data.loc[:, ["GarageYrBlt", "YearBuilt", "SalePrice"]].loc[train_data.GarageYrBlt != "None"]
# Make it numerical so that correlation succeeds.
houses_with_garages.GarageYrBlt = houses_with_garages.GarageYrBlt.astype(int)
# Find correlation.
correlation = houses_with_garages.corr()
display(correlation)
As suspected ('GarageYrBlt', 'YearBuilt') has high correlation, with 'YearBuilt' having higher correlation with 'SalePrice'. Therefore we should remove 'GarageYrBlt' from the feature set.
candidate_feature_removal.extend([ 'GarageYrBlt', 'GarageArea', 'GarageCond' ])
reduced_feature_set.extend([ 'GarageType', 'GarageFinish', 'GarageCars', 'GarageQual' ])
print("Reduced feature set: {}".format(reduced_feature_set))
Let's have a look at features that are closely related to the Porch of the House.
print("feature_set_porch: {}".format(feature_set_porch))
# Append the target variable to get the correlation with the same.
if "SalePrice" not in feature_set_porch:
feature_set_porch.append("SalePrice")
correlation = train_data[feature_set_porch].corr()
plt.subplots(figsize=(20,15))
sns.heatmap(correlation, annot=True, annot_kws={"size": 15}, cmap="coolwarm", robust=True)
sns.pairplot(train_data[feature_set_porch])
From the above correlation and scatter matrix, we find that features in this set are fairly independent and show negligible correlation among themselves. However with respect to the target variable 'SalePrice' we see that 'WoodDeckSF' and 'OpenPorchSF' show reasonable correlation whereas 'EnclosedPorch' and 'ScreenPorch' show low negative and low positive correlation respectively. '3SsnPorch' seems irrelevant when predicting house prices and therefore can be removed.
candidate_feature_removal.append('3SsnPorch')
reduced_feature_set.extend([ 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'ScreenPorch' ])
print("Reduced feature set: {}".format(reduced_feature_set))
Let's have a look at features that are closely related to the Exterior finish of the House.
print("feature_set_exterior_finish: {}".format(feature_set_exterior_finish))
# Append the target variable to get the correlation with the same.
if "SalePrice" not in feature_set_exterior_finish:
feature_set_exterior_finish.append("SalePrice")
correlation = train_data[feature_set_exterior_finish].corr()
plt.subplots(figsize=(20,15))
sns.heatmap(correlation, annot=True, annot_kws={"size": 15}, cmap="coolwarm", robust=True)
sns.pairplot(train_data[feature_set_exterior_finish])
From the above correlation and scatter matrix we can see that although 'ExterQual' and 'OverallQual' has high correlation, we should keep 'ExterQual' since it's correlation with the target variable 'SalePrice' is comparable to 'OverallQual'. 'ExterCond' and 'ExterQual' does not have a high correlation which is surprising. 'ExterCond' relationship with the target variable 'SalePrice' doesn't seem linear. We see peaks for average houses w.r.t. 'ExterCond'. Let's see how the categorical (nominal) features affect house prices.
t = train_data.loc[:, ["RoofStyle", "SalePrice"]].groupby("RoofStyle").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs RoofStyle", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs RoofStyle", fontsize=15, figsize=(14, 8))
'RoofStyle' should not be removed.
t = train_data.loc[:, ["RoofMatl", "SalePrice"]].groupby("RoofMatl").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs RoofMatl", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs RoofMatl", fontsize=15, figsize=(14, 8))
'RoofMatl' should not be removed.
t = train_data.loc[:, ["Exterior1st", "SalePrice"]].groupby("Exterior1st").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs Exterior1st", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs Exterior1st", fontsize=15, figsize=(14, 8))
'Exterior1st' should not be removed.
t = train_data.loc[:, ["Exterior2nd", "SalePrice"]].groupby("Exterior2nd").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs Exterior2nd", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs Exterior2nd", fontsize=15, figsize=(14, 8))
'Exterior2nd' should not be removed.
All the features in 'feature_set_exterior_finish' are kept in the 'reduced_feature_set'.
reduced_feature_set.extend([ 'OverallQual', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'ExterQual', 'ExterCond' ])
print("Reduced feature set: {}".format(reduced_feature_set))
Let's have a look at features that are closely related to the Lot of the House.
print("feature_set_lot: {}".format(feature_set_lot))
# Append the target variable to get the correlation with the same.
if "SalePrice" not in feature_set_lot:
feature_set_lot.append("SalePrice")
correlation = train_data[feature_set_lot].corr()
plt.subplots(figsize=(20,15))
sns.heatmap(correlation, annot=True, annot_kws={"size": 15}, cmap="coolwarm", robust=True)
sns.pairplot(train_data[feature_set_lot])
From the above correlation and scatter matrix we can see that none of the features in the current feature set show high correlation with each other. Also 'SalePrice' seems non linear w.r.t. 'LotShape'. Let's see how 'LotConfig' affect house prices.
t = train_data.loc[:, ["LotConfig", "SalePrice"]].groupby("LotConfig").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs LotConfig", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs LotConfig", fontsize=15, figsize=(14, 8))
'LotConfig' should not be removed.
All the features in 'feature_set_lot' are kept in the 'reduced_feature_set'.
reduced_feature_set.extend([ 'LotShape', 'LotFrontage', 'LotArea', 'LotConfig' ])
print("Reduced feature set: {}".format(reduced_feature_set))
Let's have a look at the remaining features of the House.
print("feature_set_miscellaneous: {}".format(feature_set_miscellaneous))
# Append the target variable to get the correlation with the same.
if "SalePrice" not in feature_set_miscellaneous:
feature_set_miscellaneous.append("SalePrice")
correlation = train_data[feature_set_miscellaneous].corr()
plt.subplots(figsize=(20,15))
sns.heatmap(correlation, annot=True, annot_kws={"size": 15}, cmap="coolwarm", robust=True)
sns.pairplot(train_data[feature_set_miscellaneous])
From the above correlation and scatter matrix we can see that 'PavedDrive' shows some correlation with 'YearBuilt' while 'YearRemodAdd' shows high correlation with 'YearBuilt'. We can remove 'YearRemodAdd' from our feature set. Also 'LandSlope' seems irrelevant when predicting house prices and therefore should be removed. Let's see how the remaining categorical (nominal) features affect house prices.
t = train_data.loc[:, ["Foundation", "SalePrice"]].groupby("Foundation").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs Foundation", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs Foundation", fontsize=15, figsize=(14, 8))
'Foundation' should not be removed.
t = train_data.loc[:, ["MasVnrType", "SalePrice"]].groupby("MasVnrType").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs MasVnrType", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs MasVnrType", fontsize=15, figsize=(14, 8))
'MasVnrType' should not be removed.
t = train_data.loc[:, ["LandContour", "SalePrice"]].groupby("LandContour").agg(["mean", "median", "count"]).sort_values([("SalePrice", "median")])
t.loc[:, [("SalePrice", "mean"), ("SalePrice", "median")]].plot(kind="bar", title="SalePrice (ascending median) vs LandContour", fontsize=15, figsize=(14, 8))
# t.loc[:, [("SalePrice", "count")]].plot(kind="bar", title="SalePrice (count) vs LandContour", fontsize=15, figsize=(14, 8))
'LandContour' should not be removed.
candidate_feature_removal.extend([ 'YearRemodAdd', 'LandSlope' ])
reduced_feature_set.extend([ 'PavedDrive', 'Street', 'Alley', 'Fence', 'OverallCond', 'YearBuilt', 'Foundation', 'Functional', 'MasVnrArea', 'MasVnrType', 'LandContour' ])
print("Reduced feature set: {}".format(reduced_feature_set))
We have reached the end of our first phase of feature reduction.
print("{} out of {} features selected for removal: {}".format(len(candidate_feature_removal), test_data.shape[1], candidate_feature_removal))
print("-" * 100)
print("Remaining {} out of {} features selected for the reduced feature set: {}".format(len(reduced_feature_set), test_data.shape[1], reduced_feature_set))
Note that in the previous section we reduced features in batches with the help of correlation matrix, scatter matrix and barplots. In this section we will run correlation with the 'reduced_feature_set' as a whole to find correlation between the larger set.
# Append the target variable to get the correlation with the same.
if "SalePrice" not in reduced_feature_set:
reduced_feature_set.append("SalePrice")
correlation = train_data[reduced_feature_set].corr()
plt.subplots(figsize=(40, 30))
sns.heatmap(correlation.round(2), annot=True, annot_kws={"size": 17}, cmap="coolwarm", robust=True, vmax=0.6)
Let's arrange the correlation with the target variable 'SalePrice' in descending order.
correlation_with_sale_price = correlation.round(4).SalePrice.sort_values(ascending=False).to_frame().drop("SalePrice", axis=0)
correlation_with_sale_price.plot(kind='bar', title="Correlation with SalePrice", fontsize=15, figsize=(20, 8))
Let's get all the unique pairs of features with correlation greater than 0.5 not including the target variable. We should also include correlation with the target variable such that their difference can identify relative importance between them. This can then be used to remove a feature
Note: We should remove a feature in a pair if it's correlation with the target variable is much less than that of the other feature.
(A, B) > 0.5; (A, Target) = x; (B, Target) = y; Remove A if x is much less than y and vice versa.
correlation_greater_than_half = []
# Iterate over correlation dataframe and construct a dictionary with relevant details
for feature, row in correlation.round(4).iterrows():
for col in correlation.columns:
# Exclude target variable since it's already calculated above.
if feature != "SalePrice" and col != "SalePrice":
if feature == col:
# Avoid duplicates.
break
if row[col] > 0.5:
# Only get correlation pairs with correlation greater than half.
correlation_feature_one_target = correlation_with_sale_price.loc[feature].values[0]
correlation_feature_two_target = correlation_with_sale_price.loc[col].values[0]
matrix_data = {
"feature_one" : feature,
"feature_two" : col,
"correlation_feature_pair" : row[col],
"correlation_feature_one_target" : correlation_feature_one_target,
"correlation_feature_two_target" : correlation_feature_two_target,
"difference_in_correlation_target" : np.abs(correlation_feature_one_target - correlation_feature_two_target)
}
correlation_greater_than_half.append(matrix_data)
correlation_greater_than_half_df = pd.DataFrame.from_dict(correlation_greater_than_half)
display(correlation_greater_than_half_df)
# Filter correlation pair which have difference in correlation with target variable greater than 0.2
display(correlation_greater_than_half_df.loc[correlation_greater_than_half_df["difference_in_correlation_target"] > 0.2])
Using the above table we can further reduce the feature set by removing 'HeatingQC', 'GarageQual', 'BsmtQual', 'FullBath', 'GarageFinish' and 'YearBuilt'.
candidate_feature_removal = ['HeatingQC', 'GarageQual', 'BsmtQual', 'FullBath', 'GarageFinish', 'YearBuilt' ]
reduced_feature_set = [x for x in reduced_feature_set if x not in candidate_feature_removal]
Save the reduced training and testing set to the disk.
# Save the updated training set to file
filename = "chkp03_train.csv"
is_present = glob.glob(filename)
if not is_present:
train_data[reduced_feature_set].to_csv(filename, index=True)
print("Reduced training set written to file: {}".format(filename))
else:
print("File: {} already present. Make sure to delete the same when running the whole notebook.".format(filename))
# Save the updated testing set to file
filename = "chkp03_test.csv"
is_present = glob.glob(filename)
if not is_present:
test_data.to_csv(filename, index=True)
print("Testing set written to file: {}".format(filename))
else:
print("File: {} already present. Make sure to delete the same when running the whole notebook.".format(filename))
This brings us to the end of our second phase of feature reduction.
# Remove the target variable from reduced feature set since it was added for correlation heatmap.
try:
reduced_feature_set.remove("SalePrice")
print("Target variable SalePrice removed from the reduced_feature_set.")
except ValueError as e:
print("SalePrice already removed from the reduced_feature_set: {}".format(e))
print("{} out of {} features selected for removal: {}".format(len(candidate_feature_removal), test_data.shape[1], candidate_feature_removal))
print("-" * 100)
print("Remaining {} out of {} features selected for the reduced feature set: {}".format(len(reduced_feature_set), test_data.shape[1], reduced_feature_set))
This checkpoint serves the updated training and testing set with reduced feature set.
Load the training set with only the required features while explicitly specifying the data type for each feature.
column_types = { 'Id' : np.int64, 'YrSold' : np.int64, 'MoSold' : np.int64, 'LotArea' : np.float64, 'BedroomAbvGr' : np.int64, 'SalePrice' : np.float64 }
# Read the training set with only the relevant features required by the benchmark model.
try:
train_data = pd.read_csv("chkp02_train.csv", dtype=column_types, usecols=column_types.keys())
train_data.set_index('Id', inplace=True)
print("Training set has {} samples with {} features each.".format(*train_data.shape))
except Exception as e:
print("Dataset could not be loaded. Error: {}".format(e))
print("Please make sure the training and the testing set is present.")
Let's glance at the training data to make sure the data is read properly.
display(train_data.head())
Let's separate our target variable from the training set.
saleprice = train_data.SalePrice
train_data.drop("SalePrice", axis=1, inplace=True)
Missing values in the dataset.
print("There are {} missing values in the training set.".format(train_data.isnull().sum().sum()))
Log transformation for training features.
# Calculate skewness for numerical features and sort them in descending order.
skewness = train_data.apply(lambda x: np.abs(skew(x))).to_frame(name="skewness_value").sort_values(by="skewness_value", ascending=False)
display(skewness)
# Filter out skewed features which have skewness value greater than 0.75
skewed_features = skewness.loc[skewness.skewness_value > 0.75].index
print("{} feature(s) selected for log transformation: {}".format(len(skewed_features), skewed_features.tolist()))
Apply log transformation for training features.
train_data[skewed_features] = train_data[skewed_features].apply(lambda x: np.log1p(x))
Verify log transformation for training features.
display(train_data.loc[:, skewed_features].sample(5))
Log transformation for target variable.
print("Feature: SalePrice, Skewness: {}".format(saleprice.skew()))
Apply log transformation for target variable.
saleprice = saleprice.apply(lambda x: np.log1p(x))
Verify log transformation for target variable.
saleprice.sample(5)
One Hot Encoding for Categorical (Nominal) features.
Not applicable since there are no such features in the current training set.
Let's split the training set into training and validation set.
# 80 percent to training set and 20 percent to validation set.
X_train, X_test, y_train, y_test = train_test_split(train_data, saleprice, test_size = 0.2, random_state = 1729)
print("Training set has {} samples with {} features.".format(*X_train.shape))
print("Validation set has {} samples with {} features.".format(*X_test.shape))
Let's build the Linear Regression model and evaluate it using RMSLE.
start = time()
regressional_model = LinearRegression().fit(X_train, y_train)
train_time = time() - start
start = time()
y_pred = regressional_model.predict(X_test)
pred_time = time() - start
# Model metrics will be used to compare models.
model_metrics = []
# Calculate the cross validation score for the current regression model over training set.
cv_scores = np.sqrt(-cross_val_score(regressional_model, X_train, y_train, scoring="neg_mean_squared_error", cv = 10))
# Calculate the RMSLE for the current regression model over validation set.
score = np.sqrt(mean_squared_error(y_test, y_pred))
print("Benchmark Model: RMSLE on Training set: {:.4f} (+/- {:.4f})".format(cv_scores.mean(), cv_scores.std()))
print("Benchmark Model: RMSLE on Validation set: {:.4f}".format(score))
model_metrics.append({ "model" : "Benchmark", "rmsle_training" : cv_scores.mean(), "rmsle_validation" : score,
"training_time": train_time, "predicting_time": pred_time })
In this section we will evaluate the below regression models learned on the reduced feature set using the same metric as above.
Load the dataset with explicitly specifying the data types for each feature.
column_types = {
'MSSubClass' : np.int64, 'MSZoning' : np.object, 'LotFrontage' : np.float64, 'LotArea' : np.float64, 'Street' : np.int64,
'Alley' : np.int64, 'LotShape' : np.int64, 'LandContour' : np.object, 'Utilities' : np.int64, 'LotConfig' : np.object,
'LandSlope' : np.int64, 'Neighborhood' : np.object, 'Condition1' : np.object, 'Condition2' : np.object, 'BldgType' : np.object,
'HouseStyle' : np.object, 'OverallQual' : np.int64, 'OverallCond' : np.int64, 'YearBuilt' : np.int64, 'YearRemodAdd' : np.int64,
'RoofStyle' : np.object, 'RoofMatl' : np.object, 'Exterior1st' : np.object, 'Exterior2nd' : np.object, 'MasVnrType' : np.object,
'MasVnrArea' : np.float64, 'ExterQual' : np.int64, 'ExterCond' : np.int64, 'Foundation' : np.object, 'BsmtQual' : np.int64,
'BsmtCond' : np.int64, 'BsmtExposure' : np.int64, 'BsmtFinType1' : np.int64, 'BsmtFinSF1' : np.float64, 'BsmtFinType2' : np.int64,
'BsmtFinSF2' : np.float64, 'BsmtUnfSF' : np.float64, 'TotalBsmtSF' : np.float64, 'Heating' : np.object, 'HeatingQC' : np.int64,
'CentralAir' : np.int64, 'Electrical' : np.object, '1stFlrSF' : np.float64, '2ndFlrSF' : np.float64, 'LowQualFinSF' : np.float64,
'GrLivArea' : np.float64, 'BsmtFullBath' : np.int64, 'BsmtHalfBath' : np.int64, 'FullBath' : np.int64, 'HalfBath' : np.int64,
'BedroomAbvGr' : np.int64, 'KitchenAbvGr' : np.int64, 'KitchenQual' : np.int64, 'TotRmsAbvGrd' : np.int64, 'Functional' : np.int64,
'Fireplaces' : np.int64, 'FireplaceQu' : np.int64, 'GarageType' : np.object, 'GarageYrBlt' : np.object, 'GarageFinish' : np.int64,
'GarageCars' : np.int64, 'GarageArea' : np.float64, 'GarageQual' : np.int64, 'GarageCond' : np.int64, 'PavedDrive' : np.int64,
'WoodDeckSF' : np.float64, 'OpenPorchSF' : np.float64, 'EnclosedPorch' : np.float64, '3SsnPorch' : np.float64, 'ScreenPorch' : np.float64,
'PoolArea' : np.float64, 'PoolQC' : np.int64, 'Fence' : np.int64, 'MiscFeature' : np.object, 'MiscVal' : np.float64,
'MoSold' : np.int64, 'YrSold' : np.int64, 'SaleType' : np.object, 'SaleCondition' : np.object, 'SalePrice' : np.float64
}
# Read the training and testing set to pandas dataframes.
try:
train_data = pd.read_csv("chkp03_train.csv", dtype=column_types)
test_data = pd.read_csv("chkp03_test.csv", dtype=column_types)
train_data.set_index('Id', inplace=True)
test_data.set_index('Id', inplace=True)
print("Training set has {} samples with {} features each.".format(*train_data.shape))
print("Testing set has {} samples with {} features each.".format(*test_data.shape))
except Exception as e:
print("Dataset could not be loaded. Error: {}".format(e))
print("Please make sure the training and the testing set is present.")
Let's separate our target variable from the training set.
saleprice = train_data.SalePrice
train_data.drop("SalePrice", axis=1, inplace=True)
Since our training and testing set has unequal number of features because of reduction in feature set from the above sections, we need to get the same subset in the testing set.
reduced_feature_set = train_data.columns
test_data = test_data.loc[:, reduced_feature_set]
print("Training set has {} samples with {} features each.".format(*train_data.shape))
print("Testing set has {} samples with {} features each.".format(*test_data.shape))
Let's have a glance at our training and testing set.
display(train_data.head())
display(test_data.head())
Missing values in the updated dataset.
print("There are {} missing values in the training set.".format(train_data.isnull().sum().sum()))
print("There are {} missing values in the testing set.".format(test_data.isnull().sum().sum()))
Log transformation for training features.
# Get the numerical feature in the reduced feature set.
numerical_features = train_data.select_dtypes(include=np.number).columns
# Calculate skewness for numerical features and sort them in descending order.
skewness = train_data.loc[:, numerical_features].apply(lambda x: np.abs(skew(x))).to_frame(name="skewness_value").sort_values(by="skewness_value", ascending=False)
display(skewness)
# Filter out skewed features which have skewness value greater than 0.75
skewed_features = skewness.loc[skewness.skewness_value > 0.75].index
print("{} feature(s) selected for log transformation: {}".format(len(skewed_features), skewed_features.tolist()))
Apply log transformation for training features.
train_data[skewed_features] = train_data[skewed_features].apply(lambda x: np.log1p(x))
Verify log transformation for training features.
display(train_data.loc[:, skewed_features].sample(5))
Log transformation for target variable.
print("Feature: SalePrice, Skewness: {}".format(saleprice.skew()))
Apply log transformation for target variable.
saleprice = saleprice.apply(lambda x: np.log1p(x))
Verify log transformation for target variable.
saleprice.sample(5)
One Hot Encoding for Categorical (Nominal) features.
Note: Values for a feature can differ in the training and testing set. Therefore one hot encoding may give unequal number of features when run individually on training and testing set. To avoid this combine the training and testing set and then run pd.get_dummies() on the combined dataset. After that reconstruct the training and testing set.
print("Before One Hot Encoding - Training set has {} samples with {} features each.".format(*train_data.shape))
print("Before One Hot Encoding - Testing set has {} samples with {} features each.".format(*test_data.shape))
print("-" * 100)
# Need the number of rows in training set to split the combined dataset properly.
n_rows = train_data.shape[0]
combined_data = combined_data = pd.concat([train_data, test_data])
print("[IN_PROCESS] Before One Hot Encoding - Combined set has {} samples with {} features each.".format(*combined_data.shape))
combined_data = pd.get_dummies(combined_data)
print("[IN_PROCESS] After One Hot Encoding - Combined set has {} samples with {} features each.".format(*combined_data.shape))
train_data = combined_data[:n_rows]
test_data = combined_data[n_rows:]
print("-" * 100)
print("After One Hot Encoding - Training set has {} samples with {} features each.".format(*train_data.shape))
print("After One Hot Encoding - Testing set has {} samples with {} features each.".format(*test_data.shape))
Let's split the training set into training and testing set.
# 80 percent to training set and 20 percent to validation set.
X_train, X_test, y_train, y_test = train_test_split(train_data, saleprice, test_size = 0.2, random_state = 1729)
print("Training set has {} samples with {} features.".format(*X_train.shape))
print("Validation set has {} samples with {} features.".format(*X_test.shape))
Now that we have our dataset ready for regression models, let's build the same.
LinearRegression
start = time()
regressional_model = LinearRegression().fit(X_train, y_train)
train_time = time() - start
start = time()
y_pred = regressional_model.predict(X_test)
pred_time = time() - start
cv_scores = np.sqrt(-cross_val_score(regressional_model, X_train, y_train, scoring="neg_mean_squared_error", cv = 10))
score = np.sqrt(mean_squared_error(y_test, y_pred))
print("LinearRegression Model: RMSLE on Training set: {:.4f} (+/- {:.4f})".format(cv_scores.mean(), cv_scores.std()))
print("LinearRegression Model: RMSLE on Validation set: {:.4f}".format(score))
model_metrics.append({ "model" : "LinearRegression", "rmsle_training" : cv_scores.mean(), "rmsle_validation" : score,
"training_time": train_time, "predicting_time": pred_time })
GradientBoostingRegressor
start = time()
regressional_model = GradientBoostingRegressor(random_state=1729).fit(X_train, y_train)
train_time = time() - start
start = time()
y_pred = regressional_model.predict(X_test)
pred_time = time() - start
# Calculate the cross validation score for the current regression model over training set.
cv_scores = np.sqrt(-cross_val_score(regressional_model, X_train, y_train, scoring="neg_mean_squared_error", cv = 10))
# Calculate the RMSLE for the current regression model over validation set.
score = np.sqrt(mean_squared_error(y_test, y_pred))
print("GradientBoostingRegressor Model: RMSLE on Training set: {:.4f} (+/- {:.4f})".format(cv_scores.mean(), cv_scores.std()))
print("GradientBoostingRegressor Model: RMSLE on Validation set: {:.4f}".format(score))
model_metrics.append({ "model" : "GradientBoostingRegressor", "rmsle_training" : cv_scores.mean(), "rmsle_validation" : score,
"training_time": train_time, "predicting_time": pred_time })
LightGBM
regressional_model = lgb.LGBMRegressor(objective='regression', random_state=1729, silent=True)
start = time()
regressional_model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
train_time = time() - start
start = time()
y_pred = regressional_model.predict(X_test)
pred_time = time() - start
# Calculate the cross validation score for the current regression model over training set.
cv_scores = np.sqrt(-cross_val_score(regressional_model, X_train, y_train, scoring="neg_mean_squared_error", cv = 10))
# Calculate the RMSLE for the current regression model over validation set.
score = np.sqrt(mean_squared_error(y_test, y_pred))
print("LGBMRegressor Model: RMSLE on Training set: {:.4f} (+/- {:.4f})".format(cv_scores.mean(), cv_scores.std()))
print("LGBMRegressor Model: RMSLE on Validation set: {:.4f}".format(score))
model_metrics.append({ "model" : "LGBMRegressor", "rmsle_training" : cv_scores.mean(), "rmsle_validation" : score,
"training_time": train_time, "predicting_time": pred_time })
In this section we will select the best model based and optimize it further using parameter tuning to reduce the RMSLE error even further.
model_metrics_df = pd.DataFrame().from_dict(model_metrics)
model_metrics_df.set_index("model", inplace=True)
display(model_metrics_df.sort_values(by="rmsle_validation").round(4))
From the above table it is clear that GradientBoostingRegressor performs the best for our current dataset since it has the lowest RMSLE in both training and validation set. Let's find the optimal parameters using GridSearchCV.
regressional_model = GradientBoostingRegressor(random_state=1729)
# Set the values for hyper parameters
hyper_parameters = {
'n_estimators': [200, 400, 800, 1600, 3200],
'learning_rate': [1.0, 0.5, 0.05, 0.025, 0.001],
'max_depth': [2, 3, 4],
'min_samples_split': [2, 4],
'min_samples_leaf': [2, 4, 8, 16],
'max_features': [16, 32, "sqrt"],
}
# Initialize the grid object specifying the above; also add the score to gauge models on.
grid_obj = GridSearchCV(regressional_model, hyper_parameters, scoring="neg_mean_squared_error", n_jobs=2)
# Fit the grid object.
start = time()
grid_fit = grid_obj.fit(X_train, y_train)
train_time = time() - start
print("Time taken for GridSeachCV: {} seconds".format(train_time))
gbr_best_parameters = grid_fit.best_params_
print("GradientBoostingRegressor (Optimized): {}".format(gbr_best_parameters))
GradientBoostingRegressor (Optimized)
start = time()
regressional_model_best = GradientBoostingRegressor(random_state=1729, n_estimators=3200, learning_rate=0.025,
max_depth=2, min_samples_leaf=4, min_samples_split=2, max_features=16).fit(X_train, y_train)
train_time = time() - start
# Predict values from the best regressional model.
start = time()
y_pred = regressional_model_best.predict(X_test)
pred_time = time() - start
# Calculate the cross validation score for the current regression model over training set.
cv_scores = np.sqrt(-cross_val_score(regressional_model_best, X_train, y_train, scoring="neg_mean_squared_error", cv = 10))
# Calculate the RMSLE for the current regression model over validation set.
score = np.sqrt(mean_squared_error(y_test, y_pred))
print("GradientBoostingRegressor (Optimized) Model: RMSLE on Training set: {:.4f} (+/- {:.4f})".format(cv_scores.mean(), cv_scores.std()))
print("GradientBoostingRegressor (Optimized) Model: RMSLE on Validation set: {:.4f}".format(score))
model_metrics.append({ "model" : "GradientBoostingRegressor (Optimized)", "rmsle_training" : cv_scores.mean(), "rmsle_validation" : score,
"training_time": train_time, "predicting_time": pred_time })
model_metrics_df = pd.DataFrame().from_dict(model_metrics)
model_metrics_df.set_index("model", inplace=True)
display(model_metrics_df.sort_values(by="rmsle_validation").round(4))
Now that we have our best regressional model let's see which are the most important features for this model.
# Get the feature importance weights.
important_features = regressional_model_best.feature_importances_
# Join the features with their appropriate weights.
important_features_df = pd.concat([pd.Series(data=train_data.columns.tolist()), pd.Series(data=important_features)], axis=1, keys=["feature", "importance"])
important_features_df.set_index("feature", inplace=True)
# Sort them in ascending order.
important_features_df = important_features_df.loc[important_features_df.importance > 0].sort_values(by="importance", ascending=True)
# Since the values are sorted in ascending order, plot the last 25 features to get the top 25 most important features.
important_features_df.tail(25).plot(kind="barh", figsize=(20, 15))
Predicted SalePrice vs. Observed SalePrice
saleprices = pd.DataFrame()
saleprices["Observed"] = np.expm1(y_test)
saleprices["Predicted"] = np.expm1(y_pred)
price_correlation = saleprices["Observed"].corr(saleprices["Predicted"])
saleprices.plot(x="Observed", y="Predicted", kind="scatter", title="Predicted SalePrice vs. Observed SalePrice (Correlation={:.4f})".format(price_correlation), figsize=(15, 15))
Final prediction on the Testing set.
# Since the target variable was log transformed use exponentiation function to get the original value.
y_pred_test = np.expm1(regressional_model_best.predict(test_data))
Let's have a look at 10 samples from the predictions.
submission_df = pd.concat([pd.Series(data=test_data.index.tolist()), pd.Series(y_pred_test)], axis=1, keys=["Id", "SalePrice"]).round(4)
display(submission_df.sample(10))
Save the predictions to disk.
submission_df.to_csv("gbr_predictions.csv", index=False)